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c3 1 Introduction 

For many decades, statisticians have made attempts to prepare the Bayesian 
omelette without breaking the Bayesian eggs; that is, to obtain probabilistic 
likelihood-based inferences without relying on informative prior distributions. 
A recent example is Murray Aitkin's recent book. Statistical Inference, which is 
the culmination of a long research program on the topic of integrated evidence. 



exemplified by the discussion paper of Aitkin (1991). The book, subtitled A 
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Integrated Bayesian/Likelihood Approach, proposes handling statistical hypoth- 
esis testing and model selection via comparisons of posterior distributions of 
likelihood functions under the competing models or via the posterior distribu- 
tion of the likelihood ratios corresponding to those models. (The essence of 
the proposal is detailed in Section [2]) Instead of comparing Bayes factors or 
performing posterior predictive checks (comparing observed data to posterior 
replicated pseudo-datasets), Statistical Inference recommends a fusion between 
likelihood and Bayesian paradigms that allows for the perpetuation of nonin- 
formative priors in testing settings where standard Bayesian practice prohibits 



their usage (DeGroot 1973) or requires an extended decision-theoretic frame- 



work (Bernardo 2011). While we appreciate the considerable effort made by 
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Aitkin to place his theory within a Bayesian framework, we remain unconvinced 
of the said coherence, for reasons exposed in this note. 

From our Bayesian perspective, and for several dis- 
tinct reasons detailed in the present note, integrated 
Bayesian/likelihood inference cannot fit within the philos- 
ophy of Bayesian inference. Aitkin's commendable attempt 
at creating a framework that incorporate the use of arbitrary 

y— noninformative priors in model choice procedures is thus in- 
coherent in this Bayesian respect. When using improper 
priors lead to meaningless Bayesian procedures for posterior 
model comparison, we see this as a sign that the Bayesian 
model will not work for the problem at hand. Rather than 
trying at all cost to keep the offending model and define 
marginal posterior probabilities by fiat (whether by BIG, 
DIG, intrinsic Bayes factors, or posterior likelihoods), we 
prefer to follow the full logic of Bayesian inference and rec- 
ognize that, when one's Bayesian approach leads to a dead 
end, one must change either one's methodologies or one's be- 
liefs (or both). Bayesians, both subjective and objective, have long recognized 
the need for tuning, expanding, or otherwise altering a model in light of its pre- 
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dictions (see, for example. Good 1950 and Jaynes 2003 ), and we view undefined 
Bayes factors as an example where otherwise useful methods are being extended 
beyond their applicability. To try to work around such problems without alter- 
ing the prior distribution is, we believe, an abandonment of Bayesian principles 
and, more importantly, an abandoned opportunity for model improvement. 

The criticisms found in the current review are therefore not limited to 
Aitkin's book; they also apply to previous patches such as the deviance informa- 



tion criterion (DIG) of Spiegelhalter et al. (2002) (which also uses a "posterior" 



expectation of the log-likelihood) and the pseudo-posteriors of Geisser and Eddy 



( 1979 ) (which make an extensive use of the data in their product of predictives). 

Unlike the author, who has felt the call to construct a partly new |Aitkin[ 
|1991| if tentatively unifying foundation for statistical inference, we have the 
luxury of feeling that we already live in a comfortable (even if not fiawless) 
inferential house. Thus, we come to Aitkin's book not with a perceived need 
to rebuild but rather with a view toward strengthening the potential shakiness 
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of the pillars that support our own inferences. A key question when looking 
at any method for probabilistic inference that is not fully Bayesian is: For 
the applied problems that interest us, does the proposed new approach achieve 
better performances than our existing methods? Our answer, to which we arrive 
after careful thought, is no. 

As an evaluation of the ideas found in Statistical Inference, the criticisms 
found in this review arc inherently limited. We do not claim here that Aitkin's 
approach is wrong per se merely that it does not fit within our inferential 
methodology, namely Bayesian statistics, despite using Bayesian tools. We ac- 
knowledge that statistical methods do not, and most likely never will, form 
a seamless logical structure. It may thus very well be that the approach of 
comparing posterior distributions of likelihoods could be useful for some ac- 
tual applications, and perhaps Aitkin's book will inspire future researchers to 
demonstrate this. 

Statistical Inference begins with a crisp review of frequentist, likelihood and 
Bayesian approaches to inference and then proceeds to the main issue: intro- 
ducing the "integrated Bayes/likelihood approach", first described in Chapter 2. 
Much of the remaining methodological material appears in Chapters 4 ( "Unified 
analysis of finite populations") and 7 ("Goodness of fit and model diagnostics"). 
The remaining chapters apply Aitkin's principles to various examples. The 
present article discusses the basic ideas in Statistical Inference, then consider 
the relevance of Aitkin's methodology within the Bayesian paradigm. 

2 A small change in the paradigm 
2.1 Posterior likelihood 

"This quite small change to standard Bayesian analysis allows a very gen- 
eral approach to a wide range of apparently different inference problems; 
a particular advantage of the approach is that it can use the same nonin- 
formative priors. " Statistical Inference, page xiii 

The "quite small change" advocated by Statistical Inference consists in en- 
visioning the likelihood function L{0, x) as a generic function of the parameter 
9 that can be processed a posteriori (that is, with a distribution induced by the 
posterior tt{9\x)), hence allowing for (posterior) cdf, mean, variance and quan- 
tiles. In particular, the central tool for Aitkin's model fit is the "posterior cdf 
of the likelihood, 

F{z) = Pr''(L(6',x) > z\x). 

As argued by the author (Chapter 2, page 21), this "small change" in perspective 
has several appealing features: 

- The approach is general and allows to resolve the difficulties with the 
Bayesian processing of point null hypotheses, being defined solely by the 
Bayesian model associated with L{0,x); 
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- The approach allows for the use of generic noninformative and improper 
priors, again by being relative to a single model; 



The approach handles more naturally the "vexed question of model fit" , 
still for the same reason; 

The approach is "simple." 



As noted above, the setting is quite similar to Spiegelhalter et al.'s (|2 002[ ) DIG 
in that the deviance D{6) = — 2 log(/(a;|6')) is a renaming of the likelihood and 
is considered "a posteriori" both in D = E[D{9)\x] and in po = D — D{6), 
where is a Bayesian estimator of 9, since 



Die =pd + D. 



The discussion of 



Spiegelhalter et al. (2002) made this point clear, see in partic- 



ular Dawid ( 2002 1 , even though the authors disagreed. Plummer] f2008) make 



a similarly ambiguous proposal that also relates to Geisser and Eddy, ( ,1979, ) by 
its usage of cross-validation quantities. 

We however dispute both the appropriateness and the magnitude of the 
change advocated in Statistical Inference and show below why, in our opinion, 
this shift in paradigm constitutes a new branch of statistical inference, differing 
from Bayesian analysis on many points. First, using priors and posteriors is no 
guarantee that inference is Bayesian (,Seidenfeldj ^1992J. Empir ical Bayes tech- 



niques are witnesses of this (Robbins 1964 Garlin and Louis, 2008). Aitkin's 



key departure from Bayesian principles means that his procedure has to be vali- 
dated on its own, rather than benefiting from the coherence inherent to Bayesian 
procedures. The practical advantage of the likelihood/Bayesian approach may 
be convenience, but the drawback is that the method pushes both the user and 
the statistician away from progress in model buildingj^ 

We envision Bayesian data analysis as comprising three steps: (1) model 
building, (2) inference, and (3) model checking. In particular, we view steps 
(2) and (3) as separate. Inference works well, with many exciting developments 
still in the coming, handling complex models, leading to an unlimited range 
of applications, and a partial integration with classical approaches (as in the 
empirical Bayes work of Efron and Morris} 1975 or more recently the similarities 
between hierarchical Bayes and frequentist false discovery rates discussed by 



^Onc might argue that, in practice, almost all Bayesians are subject to our criticism of 
"using models that make nonsensical predictions." For example, [Gelman et al."] ( |2003 i and 
[Marin and Robert] | |2007[ l are full of noninformative priors. Our criticism here, though, is not 
of noninformative priors in general but of incoherent predictions about quantities of interest. 
In particular, noninformative priors can often (but not always!) give reasonable inferences 
about parameters 6 within a model, even while giving meaningless (or at least not universally 
accepted) values for marginal likelihoods that are needed for Bayesian model comparison. It 
does when interest shifts from Pr(6|a;, H) to Pr(H\x) that the Bayesian must set aside most 
of noninformative Tr(0\ H) and, perhaps reluctantl y, set up an informative model. See, e.g., 
[Liang et al.| ( [2008} and [Johnson and Rosseil| ( [2010[ l for some current perspectives on Bayesian 
model choice using noninformative priors. 
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Efron 20101, causal inference, machine learning, and other aims and methods 



of statistical inference. 

Even in the face of all this progress on inference, Bayesian model checking 
remains a bit of an anomaly, with the three leading Bayesian approaches be- 
ing Bayes factors, posterior predictive checks, and comparisons of models based 
on prediction error and other loss-based measures. (Decision-theoretic analy- 
ses as in (Bernardo 2011[ ), while intellectually convincing, have not gained the 
same amount of popularity.) Unfortunately, as Aitkin points out, none of these 
model checking methods works completely smoothly: Bayes factors depend on 
aspects of a model that are untestable and are commonly assigned arbitrarily; 
posterior predictive checks are, in general, "conservative" in the sense of pro- 
ducing p-values whose probability distributions are concentrated near 0.5; and 
prediction error measures (which include cross-validation and DIG) require the 
user to divide data into test and validation sets, lest they use the data twice (a 
point discussed immediately below). The setting is even bleaker when trying to 



incorporate noninformative priors ( Gelman et al. 
proposals are clearly of interest. 



2003 Robert 2001 ) and new 



2.2 "Using the data twice" 

"A persistent criticism of the posterior likelihood approach (. . .) has been 
based on the claim that these approaches are 'using the data twice, ' or are 
'violating temporal coherence. " Statistical Inference, page 48 

"Using the data twice" is not our main reservation about the method — if 
only because this is a rather vague concept. Obviously, one could criticize the 
use of the "posterior expectation" of the likelihod as being the ratio of the 
marginal of the twice replicated data over the marginal of the original data. 



E[L{e,x)\x]^ / L{e,x)TT{d\x)de ^ 



m{x, x) 
m{x) 



similar to Aitkin ( 1991 ) (a criticism clearly expressed in the discussion therein). 
However, a more fundamental issue is that the "posterior" distribution of the 
likelihood function cannot be justified from a Bayesian perspective. Statistical 
Inference stays away from decision-theory (as stated on page xiv) so there is 
no derivation based on a loss function or such. Our primary difficulty with the 
integrated likelihood idea (and DIG as well) is (a) that the likelihood function 
does not exist a priori and (b) that it requires a joint distribution to be properly 
defined in the case of model comparison. The case for (a) is arguable, as Aitkin 
would presumably contest that there exists a joint distribution on the likelihood, 
even though the case of an improper prior stands out (see below). We still see 
the concept of a posterior probability that the likelihood ratio is larger than 
1 as meaningless. The case for (b) is more clear-cut in that when considering 
two models, hence a likelihood ratio, a Bayesian analysis does require a joint 
distribution on the two sets of parameters to reach a decision, even though in 
the end only one set will be used. As detailed below in Section |4j this point is 
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related with the introduction of pseudo-priors by Carhn and Chib ( 1995 ) who 



needed arbitrary defined prior distributions on the parameters that do not exist. 

In the specific case of an improper prior, Aitkin's approach cannot be val- 
idated in a probability setting for the reason that there is no joint probability 
on {0,x). Obviously, one could always advance that the whole issue is irrele- 
vant since improper priors do not stand within probability theory. However, 
improper priors do stand within the Bayesian framework, as demonstrated for 



instance by Hartigan ( 1983 1 and it is easy to give those priors an exact meaning. 
When the data are made of n iid observations = {xi, . . . ,Xn) from fg and 
an improper p rior tt is used on 9, we can consider a training sample ( [Smith and' 
Spiegelhalter |l982[) x'^'\ with (l) C {l,...,ri} such that 



/(a;(')|6')d7r(6l) < cx) {l<n). 

If we construct a probability distribution on 9 by 

7^,(o(^)(x^(^^)/(x(')|^), 

the posterior distribution associated with this distribution and the remainder 
of the sample a;^"') is given by 

7r^(o (6i|a;(-'') cx 7r{9)f{x''\9), x^"') = {x„ i ^ (Z)} . 

This distribution is independent from the choice of the training sample; it only 
depends on the likelihood of the whole data a;" and it therefore leads to a 
non-ambiguous posterior distributiorj^ on 9. However, as is well known, this 
construction does not lead to produce a joint distribution on which 
would be required to give a meaning to Aitkin's integrated likelihood. Therefore, 
his approach cannot cover the case of improper priors within a probabilistic 
framework and thus fails to solve the very difficulty with noninformative priors 
it aimed at solving. This is further illustrated by the use of Haldane's prior in 
Chapter 4 of Statistical Inference, despite it not allowing for empty cells in a 
contingency table ( Jeffreys 1939). 



3 Posterior probability on the posterior proba- 
bilities 

"The p-value is equal to the posterior probability that the likelihood ra- 
tio, for null hypothesis to alternative, is greater than 1 (...)The posterior 
probability is p that the posterior probability of Hq is greater than 0.5." 
Statistical Inference, pages 42-43 

^ Obvious extensions to the case of independent but non iid data or of exchangeable data 
lead to the same interpretation. The case of dependent data is more delicate, but similar 
interpretation can still be considered. 
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Those two equivalent statements show that it is difficult to give a Bayesian 
interpretation to Aitkin's method, since the two "posterior probabilities" quoted 
above are incompatible. Indeed, a fundamental Bayesian property is that the 
posterior probability of an event related with the parameters of the model is 
not a random quantity but a number. To consider the "posterior probability 
of the posterior probability" means we are exiting the Bayesian domain, both 
from logical and philosophical viewpoints. 

In Chapter 2, Aitkin exposes his (foundational) reasons for choosing this 
new approach by integrated Bayes/likelihood. His criticism of Bayes factors is 
based on several points we feel useful to reproduce here: 

(i) . "Have we really eliminated the uncertainty about the model parameters 

by integration? The integrated likelihood (...) is the expected value of the 
likelihood. But what of the prior variance of the likelihood?" (page 47). 

(ii) . "Any expectation with respect to the prior implies that the data has not 

yet been observed (...) So the "integrated likehhood" is the joint distribu- 
tion of random variables drawn by a two-stage process. (...) The marginal 
distribution of these random variables is not the same as the distribution 
of y (...) and does not bear on the question of the value of 9 in that 
population" (page 47). 

(iii) . "We cannot use an improper prior to compute the integrated likelihood. 

This eliminate the usual improper noninformative priors widely used in 
posterior inference." (page 47). 

(iv) . "Any parameters in the priors (...) will affect the value of the integrated 

likelihood and this effect does not disappear with increasing sample size" 
(page 47). 

(v) . "The Bayes factor is equal to the posterior mean of the likelihood ratio 

between the models" [meaning under the full model posterior] (page 48). 

(vi) . "The Bayes factor diverges as the prior becomes diffuse. (...) This prop- 

erty of the Bayes factor has been known since the Lindley/Bartlett paradox 
of 1957" (page 48). 



The representation (i) of the "integrated" (or marginal) likelihood as an 



expectation under the prior 

m{x) = J L{9,x)TT{9)d9 = E''[L{9,x)] 
is unassailable and is for instance used as a starting point for motivating the 



nested sampling method (Skilling 2006 Chopin and Robert 20101. This does 



not imply that the extension to the variance or to any other moment stated 
in (i) has a similar meaning, nor that the move to the expectation under the 



posterior is valid within the Bayesian paradigm. While the difficulty (iii) with 
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improper priors is real, and while the impact of the prior modelling (iv) may 
have a lingering effect, the other points can be easily rejected on the ground 
that the posterior distribution of the likelihood is meaningless within a Bayesian 
perspective. This criticism is anticipated by Aitkin who protests on pages 48-49 
that, given point |(v)[ the posterior distribution must be "meaningful," since 
the posterior mean is "meaningful" , but the interpretation of the Bayes factor 
as a "posterior mean" is only an interpretation of an existing integral (in the 
specific case of nested models), it does not give any validation to the analysis. 
(The marginal likelihood may similarly be interpreted as a prior mean, despite 
depending on the observation as in the nested sampling perspective. More 
generaly, bridge sampling techniques also exploit those multiple representations 
of a ratio of integrals, Gelman and Meng 1998 ) One could just as well take 



(ii) above as an argument against the integrated likelihood/Bayes perspective. 



4 Products of posteriors 

In the case of unrelated models to be compared, the fundamental theoretical 
argument against using posterior distributions of the likelihoods and of related 
terms is that the approach leads to parallel and separate simulations from the 
posteriors under each model. Statistical Inference recommends that models be 
compared via the distribution of the likelihood ratio values, 

L,{e,\x) /LkiOklx), 



where the Oi^s and 9k^s are drawn from the respective posteriors. This choice is 
similar to Scott's (|2002 ) and to Congdon's (|2006[) mistaken solutions exposed in 



Robert and Marin ( 2008 1, in that MCMC simulations are run for each model sep- 



arately and the resulting samples are then gathered together to produce either 
the posterior expectation (in Scott's, 2002, case) or the posterior distribution 
(for the current paper) of 



p,L{6i\x) / ^pkL{6k\x) , 



which do not correspond to genuine Bayesian solutions (see Robert and Marin 



20081. Again, this is not as much because the dataset x is used repeatedly in 



this process (since reversible MCMC produces as well separate samples from 
the different posteriors) as the fundamental lack of a common joint distribution 
that is needed in the Bayesian framework. This means, e.g., that the integrated 
likelihood/Bayes technology is producing samples from the product of the pos- 
teriors (a product that clearly is not defined in a Bayesian framework) instead 
of using pseudo-priors as in Carlin and Chib (19951, i.e. of considering a joint 
posterior on {9i,92), which is [proportional to] 



Pimi{x)Tri{0i\x)TT2{O2) +P2"^2(a;)7^2(6'2|a;)7^l(6'l). 



(1) 
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This makes a difference in the outcome, as ihustrated in Figure [2j which com- 
pares the distribution of the hkehhood ratio under the true posterior and under 
the product of posteriors, when assessing the fit of a Poisson model against the 
fit of a binomial model with m = 5 trials, for the observation x = 3. The joint 
simulation produces a much more supportive argument in favor of the binomial 
model, when compared with the product of the posteriors. (Again, this is inher- 
ently the flaw found in the reasoning leading to Scott's, 2002, and Congdon's, 
2006, methods for approximating Bayes factors.) 



Marginal simulation 



-4 -2 

log likelihood ratio 



Joint simulation 



H 



-10 -5 
3 likelihood ratio 



Figure 2: Comparison of the distribution of the likelihood ratio under the cor- 
rect joint posterior and under the product of the model-based posteriors, when 
assessing a Poisson model against a binomial with m — h trials, for x — i. 
The joint simulation produces a much more supportive argument in favor of the 
negative binomial model, when compared with the product of the posteriors. 

Although we do not advocate its use, a Bayesian version of Aitkin's proposal 
can be constructed based on the following loss function that evaluates the esti- 
mation of the model index j based on the values of the parameters under both 
models and on the observation x: 



L{S,{j,9j,e^j)) -ls=ilf.2{x\e2)>h{x\0t) +h=2i-f2{x\e2)<h{x\ei) ■ (2) 

Here S = j means that model j is chosen, and fj{-\0j) denotes the likelihood 
under model j. Under this loss, the Bayes (optimal) solution is 



6^x) = 



1 ifPr^[/2(x|02)</i(a;|0i)|x]>l/2 

2 otherwise, 



which depends on the joint posterior distribution ([T]) on (6*1,^2), thus differs 
from Aitkin's solution. We have 

Pr" [f2{x\02) < fi{x\0i)\x] ^7t{Mi\x) [ Pr^i [^^(^i) > P{02)\x,e2] d^2(^^2) 



+ 7r{M2\x) / Pr^^ [lH0i)>f{O2)\x,0,] d^i(0i), 
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where and l'^ denote the log-hkehhoods and where the probabihties within 
the integrals are computed under tti{9i\x) and Tr2{02\x), respectively. (Pseudo- 
priors as in [Carlin and Chib[ |1995| could be used instead of the true priors, a 
requirement when at least one of those priors is improper.) 

An asymptotic evaluation of the above procedure is possible: consider a 
sample of size n, x". If Mi is the "true" model, then 7r(7Wi|a;") = 1 + Op(l) 
and we have 



Pr 



Fp, i\ei)-f{e2) +Op{i/V^), 



with obvious notations for the corresponding log-likelihoods, pi the dimension 
of 01, 9i the maximum likelihood estimator of 0i, and Xp_^ a chi-square random 
variable with pi degrees of freedom. Note also that, since ^^(6*2) < ^^(^2), 

iHOi) - > nKL(/o, + OpiVn) , 

where KL(/, g) denotes the KuUback-Leibler divergence and 62 denotes the 
projection of the true model on • ^2 = ^rgmin^^if L(/o, /ea); we have 

Pr'' [/(x"|02) < = 1 + Op(l) . 

By symmetry, the same asymptotic consistency occurs under model A^2- On 
the opposite, Aitkin's approach leads (at least in regular models) to the approx- 
imation 

PT[Xl-Xl>f{§2)-l\0l)], 

where the and Xp_^ random variables are independent, hence producing quite 
a different result that depends on the asymptotic behavior of the likelihood ratio. 
Note that for both approaches to be equivalent one would need a pseudo-prior 
for A^2 (resp. A4i if A^2 were true) as tight around the maximum likelihood 
as the posterior 7r2(02|3^"): which would be equivalent to some kind of empirical 
Bayes type of procedure. 

Furthermore, in the case of embedded models, M2 and A4i C A42, Aitkin's 
approach can be given a probabilistic interpretation. To this effect, we write the 
parameter under Mi as (0i,'0o), ipo being a fixed known quantity, and under 
M2 as 02 = (6*1, ip), so that comparing Mi with M2 corresponds to testing the 
null hypothesis V' = V'o- Aitkin does not impose a positive prior probability on 
Ml, since his prior only bears on A^2 (in a spirit close to the Savage-Dickey 
representation, see Marin and Robert 2010). His approach is therefore similar 



to the inversion of a confidence region into a testing procedure (or vice- versa). 
Under the model Mi C M2, denoting by l{9, tp) the log-likelihood of the bigger 
model. 



Pr^ [Z(0i,Vo) >K^i»l^'^ 



KVpi > -?(^i(V'o),V'o) + /(^i,^) 
l-Fp,_p,[-/(^i(^o),V^o) + /(^i,^)], 
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which is the approximate p- value associated with the hkehhood ratio test. There- 
fore, the aim of this approach seems to be, at least for embedded models where 
the Bernstein-von Mises theorem holds for the posterior distribution, to con- 
struct a Bayesian procedure reproducing the p-value associated with the likeli- 
hood ratio test. From a frequentist point of view it is of interest to see that the 
posterior probability of the likelihood ratio being greater than one is approxi- 
mately a p- value, at least in cases when the Bernstein-von Mises theorem holds, 
e.g. for embedded models and proper priors. This p-value can then be given 
a finite-sample meaning (under the above restrictions), however it seems more 
interesting from a frequentist perspective than from a Bayesian onej^ From a 
Bayesian decision-theoretic viewpoint, this is even more dubious, since the loss 
function ^ is difficult to interpret and to justify. 

"Without a specific alternative, the best we can do is to make posterior 
probability statements about ^ and transfer these to the posterior distribu- 
tion of the Ukelihood ratio (. .) There cannot be strong evidence in favor 
of a point null hypothesis against a general alternative hypothesis. " Sta- 
tistical Inference, pages 42-44- 

We further note that, once Statistical Inference has set the principle of using 
the posterior distribution of the likelihood ratio (or rather of the divergence 
difference since this is at least symmetric in both hypotheses), there is a whole 
range of outputs available including confidence intervals on the difference, for 
checking whether or not they contain zero. From our (Bayesian) perspective, 
this solution (a) is not Bayesian for reasons exposed above, (b) is not parame- 
terization invariant, and (c) relies once again on an arbitrary confidence level. 

5 Misrepresentations 

We have focused in this review on Aitkin's proposals rather than on his char- 
acterizations of other statistical methods. In a few places, however, we believe 
that there have been some unfortunate confusions from his part. 

On page 22, Aitkin describes Bayesian posterior distributions as "formally a 
measure of personal uncertainty about the model parameter," a statement that 
we believe holds generally only under a definition of "personal" that is so broad 
as to be meaningless. As we have discussed elsewhere (Gelman, 2008), Bayesian 
probabilities can be viewed as "subjective" or "personal" but this is not neces- 
sary. Or, to put it another way, if you want to label my posterior distribution as 
"personal" because it is based on my personal choice of prior distribution, you 
should also label inferences from the proportional hazards model as "personal" 
because it is based on the user's choice of the parameterization of Cox (1972); 
you should also label any linear regression (classical or otherwise) as "personal" 
as based on the individual's choice of predictors and assumptions of additivity, 
linearity, variance function, and error distribution; and so on for all but the very 
simplest models in existence. 

•^See Chapter 7 of Gelman et al. (2003) for a fully Bayesian treatment of finite-sample 
inference. 
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In a nearly century-long tradition in statistics, any probability model is 
sharply divided into "likelihood" (which is considered to be objective and, in 
textbook presentations, is often simply given as part of the mathematical spec- 
ification of the problem) and "prior" (a dangerously subjective entity to which 
the statistical researcher is encouraged to pour all of his or her pent-up skep- 
ticism). This may be a tradition but it has no logical basis. If writers such as 
Aitkin wish to consider their likelihoods as objective and consider their priors as 
subjective, that is their privilege. But we would prefer them to restrain them- 
selves when characterizing the models as others. It would be polite to either 
tentatively accept the objectivity of others' models or, contrariwise, to gallantly 
affirm the subjectivity of one's own choices. 

Aitkin also mischaracterizes hierarchical models, writing "It is important 
not to interpret the prior as in some sense a model for nat,ure [italics in the 
original] that nature has used a random process to draw a parameter value from 
a higher distribution of parameter values ..." On the contrary, that is exactly 
how we interpret the prior distribution in the ideal case. Admittedly, we do 
not generally approach this ideal (except in settings such as genetics where the 
population distribution of parameters has a clear sampling distribution), just 
as in practice the error terms in our regression models do not capture the true 
distribution of errors. Despite these imperfections, we believe that it can often 
be helpful to interpret the prior as a model for the parameter-generation process 
and to improve this model where appropriate. 

6 Contributions of the book 

Statistical Inference points out several important facts that are individually 
known well (but perhaps not well enough!), but by putting them all in one 
place it foregrounds the difficulty or impossibility of putting all the different 
approaches to model checking in one place. Wc all know that the p-value is in 
no way the posterior probability of a null hypothesis being true; in addition, 
Bayes factors as generally practiced correspond to no actual probability model. 
Also, it is well-known that the so-called harmonic mean approach to calculating 
Bayes factors is inherently unstable, to the extent that in the situations where 
it does "work," it works by implicitly integrating over a space different from 
that of its nominal model. 

Yes, we all know these things, but as is often the case with scientific anoma- 
lies, they are associated with such a high level of discomfort that many re- 
searchers tend to forget the problems or try to finesse them. It is refreshing to 
see the anomalies laid out so clearly. 

At some points, however, Aitkin disappoints. For example, at the end of Sec- 
tion 7.2, he writes: "In the remaining sections of this chapter, we first consider 
the posterior predictive p-value and point out difficulties with the posterior pre- 
dictive distribution which closely parallel those of Bayes factors." He follows up 
with a section entitled "The posterior predictive distribution," which concludes 
with an example that he writes "should be a matter of serious concern [em- 
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phasis in original] to those using posterior predictive distributions for predictive 
probability statements." 

What is this example of serious concern? It is an imaginary problem in which 
he observes 1 success in 10 independent trials and then is asked to compute the 
probability of getting at most 2 successes in 20 more trials from the same pro- 
cess. Statistical Inference assumes a uniform prior distribution on the success 
probability and yields a predictive probability or 0.447, which, to him, "looks a 
vastly optimistic and unsound statement." Here, we think Aitkin should take 
Bayes a bit more seriously. If you think this predictive probability is unsound, 
there should be some aspect of the prior distribution or the likelihood that is 
unsound as well. This is what Good ( 1950 ) called "the device of imaginary re- 
sults." We suggest that, rather than abandoning highly effective methods based 
on predictive distributions, Aitkin should look more carefully at his predictive 
distributions and either alter his model to fit his intuitions, alter his intuitions 
to fit his model, or do a bit of both. This is the value of inferential coherence 
as an ideal. 



7 Solving non-problems 

Several of the examples in Statistical Inference represent solutions to problems 
that seem to us to be artificial or conventional tasks with no clear analogy to 
applied work. 

"They are artificial and are expressed m terms of a survey of 100 indi- 
viduals expressing support (Yes/ No) for the president, before and after a 
presidential address (. . .) The question of interest is whether there has been 
a change in support between the surveys (.■■)■ We want to assess the evi- 
dence for the hypothesis of equality Hi against the alternative hypothesis 
H2 of a change." Statistical Inference, page 1^7 

Based on our experience in public opinion research, this is not a real question. 
Support for any political position is always changing. The real question is how 
much the support has changed, or perhaps how this change is distributed across 
the population. 

A defender of Aitkin (and of classical hypothesis testing) might respond at 
this point that, yes, everybody knows that changes are never exactly zero and 
that we should take a more "grown-up" view of the null hypothesis, not that the 
change is zero but that it is nearly zero. Unfortunately, the metaphorical inter- 
pretation of hypothesis tests has problems similar to the theological doctrines 
of the Unitarian church. Once you have abandoned literal belief in the Bible, 
the question soon arises: why follow it at all? Similarly, once one recognizes 
the inappropriateness of the point null hypothesis, it makes more sense not to 
try to rehabilitate it or treat it as treasured metaphor but rather to attack our 
statistical problems directly, in this case by performing inference on the change 
in opinion in the population. 

To be clear: we are not denying the value of hypothesis testing. In this 
example, we find it completely reasonable to ask whether observed changes 
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are statistically significant, i.e. whether the data are consistent with a null 
hypothesis of zero change. What we do not find reasonable is the statement 
that "the question of interest is whether there has been a change in support." 




Figure 3: (a) Hypothetical graph of presidential approval with discrete jumps; 
(h) presidential approval series (for George W. Bush) showing movement at 
many different time scales. If the approval series looked like the graph on the 
left, then Aitkin's "question of interest" of "whether there has been a change in 
support between the surveys" would be completely reasonable. In the context of 
actual public opinion data, the question does not make sense; instead, we prefer 
to think of presidential approval as a continuously-varying process. 



All this is application-specific. Suppose public opinion was observed to really 
be flat, punctuated by occasional changes, as in the left graph in Figure [3] In 
that case, Aitkin's question of "whether there has been a change" would be 
well-defined and appropriate, in that we could interpret the null hypothesis of 
no change as some minimal level of baseline variation. 

Real public opinion, however, does not look like baseline noise plus jumps, 
but rather shows continuous movement on many time scales at once, as can be 
seen from the right graph in Figure |3j which shows actual presidential approval 
data. In this example, we do not see Aitkin's question as at all reasonable. Any 
attempt to work with a null hypothesis of opinion stability will be inherently 
arbitrary. It would make much more sense to model opinion as a continuously- 
varying process. 

The statistical problem here is not merely that the null hypothesis of zero 
change is nonsensical; it is that the null is in no sense a reasonable approximation 
to any interesting model. The sociological problem is that, from |Savage (1954 1 
onward, many Bayesians have felt the need to mimic the classical null-hypothesis 
testing framework, even where it makes no sense. Aitkin is unfortunately no 
exception, taking a straightforward statistical question — estimating a time trend 
in opinion — and re-expressing it as an abstracted hypothesis testing problem 
that pulls the analyst away from any interesting political questions. 
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8 Conclusion: Why did we write this review? 



"The posterior has a non-integrable spike at zero. This is equivalent to 
assigning zero prior probability to these unobserved values. " Statistical 
Inference, page 98 

A skeptical (or even not so skeptical) reader might at this point ask, Why did 
we bother to write a detailed review of a somewhat obscure statistical method 
that we do not even like? Our motivation surely was not to protect the world 
from a dangerous idea; if anything, we suspect our review will interest some 
readers who otherwise would not have heard about the approach (as previously 
illustrated by |Robert[ |2010 1 . 

In 1970, a book such as Statistical Inference could have had a large influence 
in statistics. As Aitkin notes in his preface, there was a resurgence of interest in 
the foundations of statistics around that time, with Lindley, Dempster, Barnard, 
and others writing about the intersections between classical and Bayesian infer- 
ence (going beyond the long-understood results of asymptotic equivalence) and 
researchers such as Akaike and Mallows beginning to integrate model-based and 
predictive approaches to inference. A glance at the influential text of Cox and 
Hinkley (1974) reveals that theoretical statistics at that time was focused on 
inference from independent data from specified sampling distributions (possi- 
bly after discarding information, as in rank-based tests), and "likelihood" was 
central to all these discussions. 

Forty years on, a book on likelihood inference is more of a niche item. Partly 
this is simply part of the growth of the field — with the proliferation of books, 
journals, and online publications, it is much more difficult for any single book 
to gain prominence. More than that, though, we think statistical theory has 
moved away from iid analysis, toward more complex, structured problems. 

That said, the foundational problems that Statistical Inference discusses are 
indeed important and they have not yet been resolved. As models get larger, 
the problem of "nuisance parameters" is revealed to be not a mere nuisance but 
rather a central fact in all methods of statistical inference. As noted above, 
Aitkin makes valuable points — known, but not well-enough known — about the 
difficulties of Bayes factors, pure likelihood, and other superficially attractive 
approaches to model comparison. We believe it is a natural continuation of this 
work to point out the problems of the integrated likelihood approach as well. 

For now, we recommend model expansion, Bayes factors where reasonable, 
cross-validation, and predictive model checking based on graphics rather than 
p-values. We recognize that each of these approaches has loose ends. But, 
as practical idealists, we consider inferential challenges to be opportunities for 
model improvement with the Bayesian realm rather than motivations for a new 
theory of noninformative priors that takes us in uncharted territories. 
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